R 数据处理

本节数据处理主要是针对data.frame数据结构,由最基本的方法到运用dplyr包,reshape包等对数据进行加工及运算。
下面通过具体实例学习R数据处理。

这是三家公司的营收与利润数额。如果大家想一步步跟随本文进行尝试,那么请将下列内容输入(或者直接复制加粘贴)到自己的R终端窗口当中:

fy <- c(2010,2011,2012,2010,2011,2012,2010,2011,2012);
company <- c("Apple","Apple","Apple","Google","Google","Google","Microsoft","Microsoft","Microsoft"); 
revenue <- c(65225,108249,156508,29321,37905,50175,62484,69943,73723);  
profit <- c(14013,25922,41733,8505,9737,10737,18760,23150,16978);
companiesData <- data.frame(fy, company, revenue, profit);
companiesData$fy <- as.factor(companiesData$fy);
companiesData;
str(companiesData)













fycompanyrevenueprofit
12010Apple6522514013
22011Apple10824925922
32012Apple15650841733
42010Google293218505
52011Google379059737
62012Google5017510737
72010Microsoft6248418760
82011Microsoft6994323150
92012Microsoft7372316978
'data.frame':    9 obs. of  4 variables:
 $ fy     : Factor w/ 3 levels "2010","2011",..: 1 2 3 1 2 3 1 2 3
 $ company: Factor w/ 3 levels "Apple","Google",..: 1 1 1 2 2 2 3 3 3
 $ revenue: num  65225 108249 156508 29321 37905 ...
 $ profit : num  14013 25922 41733 8505 9737 ...

1 向现有数据框添加column

通过将利润除以营收再乘以100的方式添加一个“margin”(利润率)column:

1.1
companiesData$margin <- (companiesData$profit / companiesData$revenue)*100;
companiesData;













fycompanyrevenueprofitmargin
12010Apple652251401321.48409
22011Apple1082492592223.94664
32012Apple1565084173326.66509
42010Google29321850529.00651
52011Google37905973725.6879
62012Google501751073721.3991
72010Microsoft624841876030.02369
82011Microsoft699432315033.09838
92012Microsoft737231697823.02945
companiesData$margin <- round(companiesData$margin, 1);
companiesData;













fycompanyrevenueprofitmargin
12010Apple652251401321.5
22011Apple1082492592223.9
32012Apple1565084173326.7
42010Google29321850529
52011Google37905973725.7
62012Google501751073721.4
72010Microsoft624841876030
82011Microsoft699432315033.1
92012Microsoft737231697823
1.2 transform 函数
companiesData <- transform(companiesData, margin = (profit/revenue) * 100);
companiesData













fycompanyrevenueprofitmargin
12010Apple652251401321.48409
22011Apple1082492592223.94664
32012Apple1565084173326.66509
42010Google29321850529.00651
52011Google37905973725.6879
62012Google501751073721.3991
72010Microsoft624841876030.02369
82011Microsoft699432315033.09838
92012Microsoft737231697823.02945
companiesData <- transform(companiesData, margin = round((profit/revenue) * 100, 1));
companiesData













fycompanyrevenueprofitmargin
12010Apple652251401321.5
22011Apple1082492592223.9
32012Apple1565084173326.7
42010Google29321850529
52011Google37905973725.7
62012Google501751073721.4
72010Microsoft624841876030
82011Microsoft699432315033.1
92012Microsoft737231697823
1.3 apply函数

dataFrame$newColumn <- apply(dataFrame, 1, function(x) { . . . } );
1代表对row应用函数,2代表对列应用函数

companiesData$margin <- apply(companiesData[,c('revenue', 'profit')], 1, function(x) { (x[2]/x[1]) * 100 });
companiesData













fycompanyrevenueprofitmargin
12010Apple652251401321.48409
22011Apple1082492592223.94664
32012Apple1565084173326.66509
42010Google29321850529.00651
52011Google37905973725.6879
62012Google501751073721.3991
72010Microsoft624841876030.02369
82011Microsoft699432315033.09838
92012Microsoft737231697823.02945
companiesData$margin <- mapply(function(x, y) round((x/y) * 100, 1)companiesData$profit, companiesData$revenue);
companiesData













fycompanyrevenueprofitmargin
12010Apple652251401321.5
22011Apple1082492592223.9
32012Apple1565084173326.7
42010Google29321850529
52011Google37905973725.7
62012Google501751073721.4
72010Microsoft624841876030
82011Microsoft699432315033.1
92012Microsoft737231697823
highestMargin <- companiesData[companiesData$margin == max(companiesData$margin),];
highestMargin





fycompanyrevenueprofitmargin
82011Microsoft699432315033.1
highestMargin <- subset(companiesData, margin==max(margin));
highestMargin





fycompanyrevenueprofitmargin
82011Microsoft699432315033.1
library(plyr);
highestProfitMargins <- ddply(companiesData, 'company', summarize, bestMargin = max(margin));
highestProfitMargins







companybestMargin
1Apple26.7
2Google29
3Microsoft33.1
myResults <- ddply(companiesData, 'company', transform, highestMargin = max(margin), lowestMargin = min(margin));
myResults













fycompanyrevenueprofitmarginhighestMarginlowestMargin
12010Apple652251401321.526.721.5
22011Apple1082492592223.926.721.5
32012Apple1565084173326.726.721.5
42010Google293218505292921.4
52011Google37905973725.72921.4
62012Google501751073721.42921.4
72010Microsoft62484187603033.123
82011Microsoft699432315033.133.123
92012Microsoft73723169782333.123
highestProfitMargins <- ddply(companiesData, 'company', function(x) x[x$margin==max(x$margin),]);
highestProfitMargins







fycompanyrevenueprofitmargin
12012Apple1565084173326.7
22010Google29321850529
32011Microsoft699432315033.1
companiesData[companiesDate\$margin==max(companiesData$margin),];

这部分代码本身并没有使用ddply(),因此得到的是最高全局利润率而非各公司最高利润率。不过由于匿名函数被加入到ddply()声明当中,而ddply()又已经按照公司名称对数据框进行了分割,所以返回的匹配行仍然符合按不同公司计算的要求。

要利用ddply()查看整个数据集中的最高利润率,而非根据公司category划分的子集,我们可以在第二个参数中输入NULL使分割factor变为无效:
highestProfitMargin <- ddply(companiesData, NULL, summarize, bestMargin = max(margin));
highestProfitMargin





.idbestMargin
1NA33.1
companiesOrdered <- companiesData[order(-companiesData$margin),];
companiesOrdered













fycompanyrevenueprofitmargin
82011Microsoft699432315033.1
72010Microsoft624841876030
42010Google29321850529
32012Apple1565084173326.7
52011Google37905973725.7
22011Apple1082492592223.9
92012Microsoft737231697823
12010Apple652251401321.5
62012Google501751073721.4
companiesData[order(companiesData$fy, -companiesData$margin),];
companiesData













fycompanyrevenueprofitmargin
72010Microsoft624841876030
42010Google29321850529
12010Apple652251401321.5
82011Microsoft699432315033.1
52011Google37905973725.7
22011Apple1082492592223.9
32012Apple1565084173326.7
92012Microsoft737231697823
62012Google501751073721.4













fycompanyrevenueprofitmargin
12010Apple652251401321.5
22011Apple1082492592223.9
32012Apple1565084173326.7
42010Google29321850529
52011Google37905973725.7
62012Google501751073721.4
72010Microsoft624841876030
82011Microsoft699432315033.1
92012Microsoft737231697823
library(reshape2);
companiesLong <- melt(companiesData, c("fy", "company"));
companiesLong































fycompanyvariablevalue
12010Applerevenue65225
22011Applerevenue108249
32012Applerevenue156508
42010Googlerevenue29321
52011Googlerevenue37905
62012Googlerevenue50175
72010Microsoftrevenue62484
82011Microsoftrevenue69943
92012Microsoftrevenue73723
102010Appleprofit14013
112011Appleprofit25922
122012Appleprofit41733
132010Googleprofit8505
142011Googleprofit9737
152012Googleprofit10737
162010Microsoftprofit18760
172011Microsoftprofit23150
182012Microsoftprofit16978
192010Applemargin21.5
202011Applemargin23.9
212012Applemargin26.7
222010Googlemargin29
232011Googlemargin25.7
242012Googlemargin21.4
252010Microsoftmargin30
262011Microsoftmargin33.1
272012Microsoftmargin23
companiesLong <- melt(companiesData, id.vars=c("fy", "company"),
                  measure.vars=c("revenue", "profit", "margin"),
                  variable.name="financialCategory", value.name="amount");
companiesLong































fycompanyfinancialCategoryamount
12010Applerevenue65225
22011Applerevenue108249
32012Applerevenue156508
42010Googlerevenue29321
52011Googlerevenue37905
62012Googlerevenue50175
72010Microsoftrevenue62484
82011Microsoftrevenue69943
92012Microsoftrevenue73723
102010Appleprofit14013
112011Appleprofit25922
122012Appleprofit41733
132010Googleprofit8505
142011Googleprofit9737
152012Googleprofit10737
162010Microsoftprofit18760
172011Microsoftprofit23150
182012Microsoftprofit16978
192010Applemargin21.5
202011Applemargin23.9
212012Applemargin26.7
222010Googlemargin29
232011Googlemargin25.7
242012Googlemargin21.4
252010Microsoftmargin30
262011Microsoftmargin33.1
272012Microsoftmargin23
companiesWide <- dcast(companiesLong, fy + company ~ financialCategory, value.var="amount");
companiesWide













fycompanyrevenueprofitmargin
12010Apple652251401321.5
22010Google29321850529
32010Microsoft624841876030
42011Apple1082492592223.9
52011Google37905973725.7
62011Microsoft699432315033.1
72012Apple1565084173326.7
82012Google501751073721.4
92012Microsoft737231697823

dcast()会将long数据框的名称作为第一项参数。大家需要遵循以下语法创建公式来充当第二项参数:
id variables ~ variable variables
其中ID与measurement变量由波浪线隔开。如果波浪线两端的变量不止一个,那么各变量彼此之间要用“加号”分隔。
第三项参数用于让dcast()将包含measurement值的列名称分配给value.var。

第二节将介绍data.table包,这是一个功能更加强大,速度更快的数据处理包。